-
Notifications
You must be signed in to change notification settings - Fork 105
replace statsd[aemon] metrics with built-in metrics library #384
Conversation
buf = WriteUint32(buf, l.prefix, []byte("p75"), r.P75/1000, now) | ||
buf = WriteUint32(buf, l.prefix, []byte("p90"), r.P90/1000, now) | ||
buf = WriteUint32(buf, l.prefix, []byte("max"), r.Max/1000, now) | ||
buf = WriteUint32(buf, l.prefix, []byte("count"), r.Count/1000, now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think you should be dividing count by 1000
552aa64
to
6f36d4c
Compare
6bf366c
to
0a511c2
Compare
now rebased on top of clustering. made some good progress, but still a bunch of things to switch over. |
0a511c2
to
fc1fd85
Compare
6428ed6
to
a7aabea
Compare
Status:
Some interesting notes (please also read the commit messages for more details):
if you want to run it and see for yourself,it's easy:
|
log.Fatal("invalid output") | ||
} | ||
|
||
// from should either a unix timestamp, or a specification that graphite/metrictank will recognize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either be a
46daacc
to
f6b9b31
Compare
running as for memory used, all memory of the stats library (under ingest +http workload) is due to the graphite queue, which can be sized as desired. |
8f86076
to
975c33b
Compare
975c33b
to
0378625
Compare
we just update it every time we add or remove metrics. technically we only need the value once every second, but the previous approach would do it at a time not matching when we send metrics. Now we could synchronize them via a callback but it would be a little too messy. we can do this about 200M/second so we're good. package mdata import ( "sync/atomic" "testing" ) var V uint32 func Benchmark_Uint32(b *testing.B) { var v uint32 for i := 0; i < b.N; i++ { atomic.StoreUint32(&v, uint32(i)) } V = v } go test -run='^$' -bench=Uint Benchmark_Uint32-8 300000000 5.14 ns/op PASS ok github.com/raintank/metrictank/mdata 2.068s
the filled areas was a bit too much. the lines are tighter. also null=connected because this overflows over ~4billion very regularly
most times you import the dash (docker stack, fresh install) you want to zoom in on recent stats. Seems more common than importing a dash for an already existing setup.
9d300bc
to
a6823c5
Compare
@replay what do you think of the new graphite conn routine? i haven't subjected it to the tests outlined in #384 (comment), but will, if the code looks good to you. |
a6823c5
to
650d027
Compare
c.msgsConnections.Set(s.Connections) | ||
} | ||
}() | ||
return &c, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there was an err
at nsq.NewConsumer()
wouldn't it probably be better to return right there, without creating a go routine that's probably going to fail once a second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah
func NewErrMetrics(component string) ErrMetrics { | ||
return ErrMetrics{ | ||
|
||
// metric idx.cassandra.error.timeout is a counter of timeouts seen to the cassandra idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems outdated, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cleared this up on slack. but basically https://github.com/Dieterbe/metrics2docs is a bit simplistic, it can't parse the code very well, and as there are 2 places where we dynamically construct these metrics:
mdata/store_cassandra.go
68: errmetrics = cassandra.NewErrMetrics("store.cassandra")
idx/cassandra/cassandra.go
48: errmetrics = cassandra.NewErrMetrics("idx.cassandra")
we need to add two comments for each metric here, and there needs to be an empty line between them because otherwise metrics2docs fails
## misc ## | ||
|
||
# instance identifier. must be unique. used in clustering messages, for naming queue consumers and emitted metrics. | ||
instance = defaulti |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, but is that a typo?
# The default matches what the Grafana dashboard expects | ||
# $instance will be replaced with the `instance` setting. | ||
# note, the 3rd word describes the environment you deployed in. | ||
prefix = metrictank.stats.defaulte.$instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is defaulte
another typo or do i just totally not get what you did there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was because i wanted to make sure the environment and instance showed up in the right place (default.default
was a bit ambiguous). but I can undo that now and set both to default
} | ||
|
||
func (g *Gauge64) Dec() { | ||
atomic.AddUint64(&g.val, ^uint64(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
m.max = val | ||
} | ||
c := m.hist[bin] + 1 | ||
m.hist[bin] = c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't this be m.hist[bin]++
? c
does not seem needed
for _, k := range keys { | ||
key := uint32(k) | ||
runningcount += m.hist[key] | ||
runningsum += uint64(m.hist[key]) * uint64(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does that get multiplied with uint64(key)
? maybe a comment would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the running sum. so it has to multiply the value in the bucket with the number of samples that had that value.
for { | ||
select { | ||
case <-connectTicker: | ||
conn = assureConn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry that i keep going on about this :) but what's the purpose of still having this connectTicker
here?
- when a new message gets read from
g.toGraphite
ok
will be initialized tofalse
on line115
, so theconn
will be created byassureConn()
on117
- if during writing the connection fails then
err
will be set to!= nil
on line119
, sook
never gets set totrue
, so it will loop on thefor
at116
and reconnect at117
Why still have the ticker then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha you're right again I think.
let me do another round of simplifying :)
func (r *Registry) add(name string, metric GraphiteMetric) GraphiteMetric { | ||
r.Lock() | ||
if _, ok := r.metrics[name]; ok { | ||
panic(fmt.Sprintf(errFmtMetricExists, name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dont think this is an un-recoverable error. We should return an error instead.
Perhaps it would be better to use an addOrGet() method that either returns an existing graphiteMetric with the specified name or creates the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think AddOrGet is a good idea.
However, the pre-existing metric may be a different type. E.g. trying to register a Bool, but there's already a Meter32 or something. There's not much useful we can do in this scenario (we could keep running but without the instrumentation, which I think is pretty dangerous and we'd have to make the code more complicated to support a graphite metric type which is a no-op). So if the pre-existing one is different type, i think a panic is still warranted.
if pre-existing is the same type, then we can totally just return that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
made the change woodsaj requested, tested the new loop using the same experiments as #384 (comment) and it all still works as before. merged to master from commandline. |
just adding my prototype new library.
didn't do any work integrating it into metrictank yet because the code base needs to stabilize first (e.g. after merging at least the http refactor, potentially also clustering)